Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.


Introduction
Robust and efficient selection of covariates in logistic regression models is a core objective of statistical analysis in domains as diverse as epidemiology, forestry and insurance risk assessment as it enhances interpretation of fitted models and leads to improved prediction of the binary outcome (see, for example [1][2][3]). As compared to the ordinary logistic regression (OLR) model combined with traditional variable selection methods such as best subset selection, regularized versions of OLR have the potential to flag important variables in the presence of multicollinearity and are computationally attractive for high dimensional data. Bühlmann and Van De Geer [4] provide an excellent exposition of regularization techniques such as Lasso [5], elastic net [6], group Lasso [7], Dantzig selector [8] and SCAD [9]. Presence of rare events, as manifested through severe imbalance in observed frequencies of the binary classes, along with high dimensional and potentially multicollinear data can lead to serious challenges in deriving stable variable importance rankings in regularized regression models. For instance, it is well known that presence of a large number of irrelevant (noise) covariates having a strong correlation structure with the relevant (signal) covariates can seriously exacerbate variable selection in regularized regression models such as Lasso [10]. Classimbalance occurs when one class is represented by a very small number of examples (minority class) as compared to the other (majority) class, where despite the fact that the majority class makes up most of the cases, it is the minority class that is often relevant for statistical analysis. Rarity of the minority class introduces problems and complications in conducting statistical inference [11,12]. A severe minority-to-majority class ratio, such as in the range of 1:100 to 1:1000, often leads to very high volumes of data (big data) and significantly increases computational cost of model estimation [13].
The issue of severe class-imbalance can be addressed by employing response-based random sampling, which is done by sampling a subset of instances from both classes [14]. Response-based sampling is commonly used in case-control studies in epidemiological research and choice-based studies in econometrics [15][16][17]. Hosmer and Lemeshow [18] provide a thorough treatment of the effect of response-based sampling on estimation of the regression coefficients in logistic regression. A key advantage of response-based downsampling is that it results in substantially reduced datasets which in turn leads to a marked reduction in computational burden in training the models for data with millions of records. We refer the reader to [13,19] for a detailed overview of sampling-based approaches in statistical and machine learning applications involving high class-imbalance in big binary data.
In this study, we employ response-based downsampling to develop a novel variable ranking and selection algorithm in the context of regularized OLR models. Our methodology consists of the following two key features: (i) repeated subsampling of the minority class (controls) to create a large number of balanced datasets (say M), and (ii) computing stabilized aggregate covariate rank scores by generating an ensemble of regularized OLR model fits using the M balanced datasets created via case-control sampling. Our approach is similar in spirit to Bach's [20] Bolasso algorithm which is based on replications of bootstrap sampling and intersecting the supports (i.e. covariates with nonzero coefficients) of the resulting Lasso bootstrap estimates. Bach [20] shows that this approach leads to consistent model selection whereas the original Lasso model estimated from a single dataset is shown to be inconsistent in the presence of strong correlations between signal and noise covariates [21]. Bolasso exploits the fact that bootstrap mimics availability of a large number of datasets by resampling from the same unique dataset and thereby diminishing the probability of selecting noise covariates from the intersected supports. We refer the read to [22][23][24][25][26][27][28] for examples of other recent approaches that involve the general notion of subsampling to perform variable selection in the context of machine learning based classifiers. However, recent variable selection literature, both in the context of machine learning classifiers and other regularized based methods, is focused on analyzing datasets that are moderate in size and with respect to class-imbalance severity. For example, highest sample size and the most severe class-imbalance ratio considered in [22][23][24][25][26][27][28] are 20,000 and 1:42 [26], respectively. In contrast, our methodology is designed to deal with variable selection in very large datasets in a computationally efficient manner, e.g. our case study models involve approximately 5 to 13 million observations with class-imbalance ratios around 1:500.
Here we replace the bootstrap sampling as employed in [20] with repeated response-based subsampling to create a large number of balanced datasets for variable ranking and selection in the presence of severe class-imbalance with potentially large volumes of data and varying degrees of multicollinearity. This study presents an extension to a heuristic variable ranking approach previously introduced in Nadeem et al. [29] who employed it to rank covariates in a large wildland fire occurrence dataset. Our work here is novel in that we rigorously develop the ranking approach, employ it to propose a new variable selection methodology for a general class of regularization methods, and thoroughly assess its selection performance by conducting an extensive simulation study. We demonstrate our framework using three regularization methods including Lasso [5,30], adaLasso [21] and ridge regression [31]. Another important novel element of our stable variable ranking and selection (SVRS) methodology is its flexibility: it does not require hard shrinkage of the regression coefficients, as is the case with Lasso, and works equally well with the general class of regularization techniques that induce soft shrinkage only. We conduct a detailed simulation experiment and use a real-world case study involving big and highly imbalanced wildland fire occurrence datasets to show that, when compared to the usual practice where only a single regularized model fit is employed to determine variable importance, our variable selection approach successfully filters out the noise covariates and recovers a substantially higher proportion of signal covariates.

The regularization methods
Binary response outcomes, observed as a negative (y i = 0) or a positive (y i = 01) occurrence from the i th instance (example), can be modeled using the ordinary logistic regression model with the joint likelihood function for n instances given as: for a given vector of p covariate values, x i . The probability of a positive occurrence is then parametrised via the logistic link function: where α is the intercept term and β is the vector of regression coefficients. The resulting negative log-likelihood function can be expressed as follows: Lasso. Lasso is a commonly employed regularization method for OLR [5,30] where l 1 penalty on β is imposed, leading to the following penalised form of (1): where kbk 1 ¼ P p j¼1 jb j j is the l 1 -norm penalty, λ is a tuning parameter to be estimated separately and covariates typically enter in (2) in standardized form for the penalty to be meaningful. Lasso is a much more popular choice as compared to OLR for high dimensional data in that the l 1 penalty allows simultaneous regularization of the coefficients and model selection by shrinking some of the coefficients to zero. Furthermore, efficient computational algorithms are available for computing the entire solution path for the tuning parameter λ for optimizing (2) [6,30].
However, lasso does have its limitations including lack of robustness to high correlations among covariates and violation of the oracle property. A procedure is said to satisfy the oracle property if it estimates regression coefficients corresponding to noise covariates as zero with probability approaching to 1 and produces asymptotically unbiased and normally distributed estimates of the nonzero coefficients. Lasso can only perform consistent variable selection under strong conditions on the design matrix [21].
Adaptative lasso (adaLasso). The adaLasso was introduced by Zou [21] which unlike the standard Lasso, is consistent in variable selection and satisfies the oracle property. Here the l 1 penalty is modified by incorporating adaptive data-driven weights to the penalty. The adaLasso solution is obtained by maximizing the following penalized form of (1): where w j 0 s are the data-driven weights given as w j ¼ 1=ĵb � j j g ; g is a positive constant andb � j is an initial estimator of β j . Cross-validation can be employed to obtain optimal values of γ and λ from a grid of values where (0.5, 1, 2) are commonly used values of γ [21]. Here we use the standard Lasso solution for initial valuesb � j and set γ = 1 for simplicity of exposition. Ridge regression. Ridge regression [31,32] constraints the regression coefficients using the l 2 norm penalty, j bj j j ¼ P p j¼1 b 2 j , instead of the l 1 penalty employed in (1). It tends to perform particularly well in the presence of strong multicollinearity among many covariates with potentially small effect sizes as it effectively regularizes the coefficients through a trade-off in bias and variance. As opposed to the Lasso solution, ridge regression produces even shrinkage for the correlated covariates [30]. A limitation of ridge regression is that it induces soft shrinkage, i.e. it does not force coefficients to vanish and therefore does not automatically perform variable selection, as is the case with Lasso and adaLasso.
The ridge regression based penalised log-likelihood function takes the following form: The tuning parameter λ in (2-4) can be estimated using various methods including crossvalidation [5], which we employ here in fitting models (2-4) using the R package glmnet [30].

Class-imbalance and response-based sampling
As noted earlier, rarity of one of the binary response classes (usually the positive cases, Y = 1) can lead to significant imbalance in observed frequencies of the two classes. Datasets with severe class-imbalance are often prohibitively large and challenging for training statistical and machine learning models in a computationally tractable manner [33]. In this study we employ a response-based downsampling scheme for controls (Y = 0), where we retain all case instances and draw a simple random sample from controls. Let n be the size of the entire available sample and n 0 and n 1 (n 0 > n 1 ) be the total number of controls and cases, respectively; then n 1 instances are randomly selected from the n 0 controls resulting in a balanced (and therefore reduced) dataset of size n b = 2n 1 . This has the effect of inducing an offset intercept term, log (p 1 /p 0 ), into the full-data linear predictor g p i ð Þ ¼ a þ x t i β, i.e. [18]: where p � i ¼ PrðY i ¼ 1j x i ; s i ¼ 1Þ, s i is selection status (0 or 1) of the i th instance in the original dataset; p 1 and p 0 are selection probabilities of cases and controls, respectively. These probabilities can be determined from respective sampling proportions. In case of the balanced sampling design employed in this study, p 1 = 1 and p 0 is estimated as the ratio of the number of controls to the number of cases in the full dataset.

Stable Variable Ranking and Selection (SVRS) algorithm
Our variable ranking and selection algorithm is based on the following key steps: 1. Start with the full dataset of size n, Z = [z 1 , z 2 , . . ., z n ], where z i ¼ ðy i ; x t i Þ t is the i th observed data vector and Z can be written as: Z = [Z 0 , Z 1 ]; Z 0 and Z 1 denote data partitions corresponding to y i = 0 (control) and y i = 1 (case) observations, respectively.
ii. Generate the balanced dataset of size n b as Z ðbÞ iii. Fit a regularized OLR model (e.g. Lasso) to Z ðbÞ j and store the estimated regression coefficients vector, b β j .
3. Employ the variable ranking algorithm described in the next section to the resulting M × p to compute an aggregate rank score, Rank(X i ), for each covariate X i , i = 1, 2, 3, . . ., p.
4. Choose a subset of variables as the most influential (relevant) covariates by selecting a threshold rank score from sorted rank scores, Rank(X i ).
Next, we describe the variable ranking algorithm and our methodology for selecting a rank threshold for variable selection.
Ranking algorithm. For simplicity of exposition, we drop the hat symbol from the various quantities presented in this section, e.g.b j is replaced by β j . Let us assume that the computed regression coefficients β j are on their original scale of measurement (i.e., not standardized covariates) and let β D denote the corresponding vector of standardized coefficients whose elements, b D i;j , are defined as b D i;j ¼ sd x i ð Þb i;j , where sd(x i ) is the standard deviation of the observed values for the i th covariate. These standardized coefficients correspond to transformed covariates, z k ¼ x k À � x k sdðx k Þ . Also, let (R 1,j , R 2,j , . . ., R p,j ) be the ranks assigned to absolute values of b D i;j in the vector of standardized coefficients, namely jb D 1;j j; jb D 2;j j; . . . ; jb D p;j j � � , where 1 � R i,j � p. Note that the covariates ranked at P and 1 have the highest and lowest values of the absolute standardized coefficients, respectively. After sorting in increasing order we denote resulting vector as ð Þ ;j j and X i h ð Þ is the covariate whose assigned rank value is h. For instance, if X 5 gets assigned a rank value of 10, we have i (10) = 5. Note also that for a given rank position h, X i ðhÞ can differ between model fits for j = 1, 2, . . ., M.
An aggregate rank score based on M fits of the model is given by the following formula: where I A is an indicator function of event A and h indicates the rank positon. Theorem 1. We have: hI R i;j ¼h : Then we must have 1 � P p h¼1 hI R i;j ¼h � p for a given j. So, the average over the index j must also satisfy the same constraints.
ii) Consider the sum For regularization methods with hard shrinkage, such as Lasso logistic regression, it is common for several of the estimated coefficients to be forced to zero values at convergence as a consequence of the l 1 regularization penalty. When displaying a ranking of estimated coefficients based on Lasso logistic fit, these zeroed-out coefficients need not be included in the ranking as they have no effect on predictive performance. A modified version of (1) is therefore presented as follows.
Let x � be a covariate having a zero coefficient in some j th Lasso fit. Suppose x � is ranked at 10 in a particular fit, then its rank contribution in (1) from this fit is h = 10; even though its estimated coefficient is 0. However, we should penalize x � for this as it was in fact dropped from selection in that j th fit. This is achieved by restructuring (1) as follows: where Rank(X i ) � p, and P p i¼1 Rank X i ð Þ � P p h¼1 h. A related scenario can also arise when several covariates are repeatedly zeroed-out in all M model fits, causing all the regression coefficients falling within a certain rank position to become zero. This leads us to the following definitions and a related theorem.
Definition 1 (supported variables): The set of supported covariates is given as Essentially, χ consists of all covariates that were not dropped in at least of the M model fits. For soft-shrinkage regularization such as in Ridge regression, we typically have χ = x.
Definition 2 (effective maximum rank): The effective maximum rank is given as: This also implies that jb D ih þ1 ð Þ ;j j must all be zero because, for any given j, we have by definition: Therefore, considering Theorem 2, we further modify (7) as follows.
where X i 2 χ, and the second summation on the right is taken over the first p EF columns ofB. Here, (3) now satisfies the properties: where Rank(X i ) = 0 for X i 2 χ c . We use (8) as a basis for ranking covariates arising from the model fits. Note that (8) is equivalent to (6) when χ = x.
Threshold selection. Acquiring a list of covariates ranked by their relative importance obtained from the algorithm described above only completes the task of computing stable variable rank scores. Our objective here is to determine how many and which of these covariates are influential predictors in the logistic regression model. One simple approach to selecting important covariates is to examine selection metrics for choosing a threshold that partitions the covariates in influential and irrelevant sets. For instance, Nadeem et al. [29] used a selection metric, p drop , to determine the selection threshold. The metric p drop is defined as: that is, it denotes fraction of times the i th covariate is dropped from Lasso logistic model fits. Nadeem et al. [29] then plotted p Drop i values and noted that it tends to have a change point which they determined visually by choosing a threshold to separate the variables in clusters of important and irrelevant covariates (see Fig 8 in Nadeem et al. [29]). Notice that the notion of the p drop metric used by Nadeem et al. [29] for variable selection is similar in spirit to the concept of empirical selection probability as introduced by Meinshausen and Bühlmann [34]. However, here we introduce Rank(X i ) as the selection metric and show that it works well for both Lasso and other regularization methods that does not enforce hard shrinkage of the covariates.
We employ automatic thresholding of the rank scores based on change-point detection methodology [35][36][37]. Suppose the rank scores are ordered as a sequence of values (r 1 , r 2 , . . ., r v ) and if there exists an index τ 2 (1, 2, . . ., v − 1), such that some feature (e.g. mean) in the probability distribution of (r 1 , r 2 , . . ., r τ ) and (r τ+1 , r τ+2 , . . ., r v ) differ, then a change point has occurred. We employ the changepoint package [38] in R, to find a single change point in the mean Rank(X i ) values. We refer the reader to Killick et al. [39] for further details on the detection tests implemented in the changepoint package.
Computational efficiency. Here we provide an illustration of computational efficiency of the SVRS algorithm in the context of the Lasso regularization method. Similar gains in efficiency are expected under other regularization techniques, such as ridge regression, due to the use of response-based subsampling. Computational complexity of Lasso algorithm is O(np 2 + p 3 ) [34] for n > p case considered in this study. The regularization parameter is selected in practice using K-fold cross-validation, which results in a computing time of O (np 2 (K − 1) + p 3 ). On the other hand, the cost for a single balanced dataset with K-fold cross-validation is O(2n 1 (K − 1)p 2 + p 3 ), where n 1 is the total number of cases in the entire dataset. Computational complexity for SVRS is therefore O(2n 1 M(K − 1)p 2 + p 3 ), when Lasso model is fitted to M balanced datasets. As we demonstrate in our case study analyses later, M = 100 is a reasonable choice for very large imbalanced datasets. Hence, algorithmic complexity of SVRS scales with the size of the minority class only. For instance, one of the datasets analyzed in our case study has n = 10.7 million and n 1 = 22,525. Therefore, fitting the Lasso model to the entire original data would require approximately 238 times as many computational resources as needed to analyze a single balanced dataset. Whereas fitting the model to full data would still cost about 2.4 times more resources than running our SVRS algorithm.
It is however crucial to note that fitting regularized versions of the OLR model to massively large datasets is often computationally infeasible due to restricted amount of memory (RAM) available on computers. This issue is especially exacerbated for commonly used analysis languages such as R. The SVRS algorithm is therefore attractive in the sense that: i) it circumvents the computational bottleneck by making it feasible to estimate the model from much smaller subsets of the original data, and ii) its implementation is highly parallelizable which yields further gains in computational efficiency. For the abovementioned case study example, parallelized implementation of Lasso based SVRS algorithm (M = 100, p = 82) only took about 7.5 minutes on a 3.4-GHz machine with 16 cores, using glmnet R package [30].

Simulation experiment
We simulate data under (1) using four OLR models with varying number of signal and noise covariates where two of them include correlated noise and signal covariates, while the remaining two models do not possess correlated covariates. Model descriptions along with respective ratios of number of signal to number of noise covariates, s: m; and regression coefficient vectors, are reported in Table 1.
All covariates under UNCOR-12 and UNCOR-24 are distributed as independent random variables. We also apply logistic transformation to the normally distributed covariates so that their support is the unit interval, [0,1]. This was done to ensure that the covariates scales are relatively consistent, and the magnitudes of regression coefficients (Table 1) reflect relative variable importance in the simulated OLR model. Distribution and correlation structure of the covariates under COR-12 and COR-24 simulation models are described in the subsection ahead.
Simulation of models with correlated covariates. We include both categorical and continuous correlated covariates in the COR-12 and COR-24 simulation models. Touloumis [40] provide a computational framework to simulate correlated clusters of categorical variables using marginal baseline-category logit models [41]. They describe and implement the NORmal-To-Anything (NORTA) method introduced by Cario and Nelson [42] for simulating correlated binary and nominal responses under a desired marginal model specification. We employ R package SimCorMultRes [40] to simulate data and use the NORTA method to generate data for categorical responses which we incorporate in our simulation models as categorical covariates.
Data generation process under COR-12 and COR-24 models is described as follows where the first s covariates (X 1 , X 2 , . . ., X s ) are signal and the rest are all noise (Table 1).

COR-12. We generate four clusters of a 4-category nominal variable
Þ denote the four dummy variables (i.e. X 1 + X 2 + X 3 + X� 1 = 1) corresponding to the categories of Z c,1 in cluster 1. Similarly, we have Z t c;2 ¼ X 4 ; X 5 ; X 6 ; X �2 ð Þ and Z t c;3 ¼ X 13 ; X 14 ; X 15 ; X �3 ð Þ for cluster 2 and 3 respectively, where Z t c;3 contains noise covariates. Here, Z t c;i are all correlated, and we do not use X� i to avoid perfect multicollinearity among the four dummy variables (e.g. X �1 ¼ P 3 i¼ X i ). We do not also incorporate Z t c;4 in the COR-12 simulations. We further generate a separate set of correlated binary covariates (X 16 , X 17 , X 18 ) independently from the categorical covariates.
Generation of balanced datasets. We create balanced datasets using response-based sampling performed on initial datasets (of size n) that are generated under the four simulation models with three class-imbalance ratios (I R ) and increasing sizes (n b ) of the balanced samples. Combinations of various (I R , n b ) values considered in our simulation experiment along with initial sample sizes (n) required to ensure various imbalance ratios are reported in Table 2. We induce rarity of the positive class by incorporating appropriately small values of the intercept, α, in the simulation models. We generate M = 500 balanced datasets under each of the resulting 60 combinations involving four models, three I R values and five n b values (Tables 1 and 2). Lasso, adaLasso and ridge regression are then fitted to each balanced dataset to generate the estimated coefficient vectors, b β j ; j ¼ 1; 2; . . . ; 500 (see SVRS algorithm steps above).   model with varying imbalance ratios (I R ). We notice that distribution of simulated probability for the original imbalanced data is highly skewed towards zero (i.e. cases are rare), where degree of skewness exacerbates as I R becomes severe (Fig 1A). However, balancing the sample removes the skewness and renders a distribution that is invariant across imbalance ratios ( Fig  1B). We observed similar characteristics in other simulation models as well.

Results
The  all simulation scenarios. We find that the individual model fits have reasonable skill in predicting the binary response variable based on the mean cross-validated AUC scores corresponding to the optimal value of the tuning parameter, denoted here as λ min . Notice that all results presented in this section are based on λ min .
Selection performance with un-correlated data UNCOR-12. Fig 3 depicts TPR, FPR, FDR and AUC values obtained from the SVRS algorithm and corresponding distribution of scores across the individual fits for the various regularization methods, imbalance ratios (I R ) and balanced sample sizes, n b . Apart from the considerable variation in individual TPR scores for Lasso at the smallest sample size (n b = 1000), both Lasso and adaLasso scores are stable and close to 1 irrespective of the sample size and I R . It is also evident that SVRS TPR scores are near perfect for the three regularization methods in all scenarios. The individual FPR and FDR scores for Lasso and ada-Lasso are however much more variable with median values dropping as a function of n b . The highest mean FPR score for the Lasso and adaLasso was 0.191 and 0.176 respectively, which is approximately 17 and 15 noise covariates selected on average (Table 3). Similarly, the maximum FDR values ranged from 0.261 to 0.521, and 0.218 to 0.504 for Lasso and adaLasso respectively, showing that a substantially large proportion of selected covariates from an individual fit can be false positives. The SVRS FPR and FDR scores across the two Lasso methods on the other hand range from 0 to 0.023, and 0 to 0.143 respectively, across all scenarios (Table 3). In comparison, ridge regression registers higher SVRS FPR and FDR values ranging in 0 to 0.136 and 0 to 0.520 respectively. The suboptimal performance of the individual across Lasso and adaLasso fits is also clear from viewing the large variability in AUC scores (Fig 3), with poor scores observed in all scenarios. SVRS algorithm improves the selection performance by a substantial margin with AUC scores ranging in 0.958 to 1 for the two Lasso methods. SVRS based AUC scores for Ridge regression are comparatively lower due mostly to worse performance in terms of FPR and FDR.
UNCOR-24. Distributions of TPR values for the two Lasso methods reveal that individual fits tend to recover a much lower proportion of signal covariates under UNCOR-24 when compared to UNCOR-12; however, mean TPR increases with sample size (Fig 4; Table 4). This is explained in part from noticing that UNCOR-24 has a substantial proportion of covariates with relatively small effect sizes (|β i | < 0.35; Table 1) that are hard to detect at a lower sample size. The SVRS TPR scores on the other hand are generally above the 25 th percentile of the individual scores, especially for higher sample sizes. Ridge regression shows further improvement in terms of SVRS TPR scores. Behavior of FPR and FDR scores based on individual fits and SVRS algorithm is similar to that of under UNCOR-12 with one exception: scores do not improve much with the increasing sampling size. It is evident from Fig 4 that SVRS AUC scores are generally higher than the median score from the individual fits, which shows that SVRS is superior at achieving stable ranking and selection as compared to individual fits.  0.847 to 1 (Table 5). Likewise, SVRS tends to recover most or all signal covariates despite the severe class-imbalance and high correlation between the signal and noise covariates (TPR > 0.83 for n b > 1000; Table 5). Even at n b = 1000, SVRS TPR values are higher than the 25 th percentile of the individual score distributions in most cases (Fig 5). Performance in terms of FPR and FDR is again similar to the pattern seen under the un-correlated simulation models as individual Lasso and adaLasso scores are much worse with high variability as compared to SVRS scores. Ridge regression has substantially higher SVRS FPR and FDR scores. We see that the two Lasso methods have similar performance in terms of AUC scores, which is far superior as compared to the individual fits, with values range from 0.875 to 1 across the two methods. On the other hand, SVRS AUC scores for ridge regression are relatively smaller owing to large number of higher false positive discoveries. COR-24. COR-24 is the most involved simulation model including several highly correlated candidate covariates (Table 1) where the correlation structure spans across signal and noise variables. Here we see that the mean TPR scores for Lasso/adaLasso individual fits range from 0.740 to 0.921 with high variability in the score distribution (Fig 6). On the other hand, SVRS based TPR values exceed the 25 th percentile of the individual score distribution in most cases. It is evident that SVRS algorithm with Lasso/adaLasso excels at filtering out the noise covariates with very low FPR and FDR scores ranging from 0 to 0.034 and 0 to 0.217, respectively. Individual Lasso/adaLasso fits in comparison do not provide a reliable selection tool as they tend to include a very large proportion of noise covariates in the selection set (Fig 6;  Table 6; FDR range: 0.349 to 0.485). Ridge regression has near perfect SVRS TPR values in all cases, but with much higher FPR and FDR scores as compared to Lasso and adaLasso. However, SVRS based selection performance of the three regularization methods is quite similar in terms of AUC scores with Ridge regression registering slightly higher scores (Fig 6;  Table 6). We emphasize that a majority of SVRS AUC scores for Lasso and adaLasso lie beyond the 75 th percentile of the individual score distributions across all scenarios. Overall, our results clearly demonstrate that SVRS algorithm renders a marked improvement over individual fits by stabilising the selection variability inherent in those fits.

Case study: Analysis of wildland fire occurrence data
In this section, we consider a large wildland fire dataset compiled by Nadeem et al. [29] comprising fire occurrence records by ignition source (human-and lightning-caused) and a suite of explanatory variables on a fine spatial grid in British Columbia, Canada, during 1981-2014. Wildland fires account for an average of 2.5 million hectares of burnt area across Canada annually, costing the government of Canada up to $1.5 billion per year in fire suppression and management [43]. The province of British Columbia is a major contributor to daily fire load in Canada, as 70% of its land area contains coniferous forest or grasslands that are potentially flammable [29]. Anthropogenic climate change is attributed to further exacerbate the risk of extreme fire-weather conditions in recent years [44]. For instance, a record 1.2 million ha of wildland burned during the historically extreme 2017 wildfire season alone [45]. It is therefore crucial to develop robust statistical models that enable identification of key underlying drivers of fire occurrence process and predicting daily ignitions across the province for fire management preparedness and detection planning. Response variable in Nadeem et al.'s [29] dataset is a Bernoulli random variable Y (indicator variable for at least one ignition) observed on a space-time voxel that represents a 24-h time period for a 20×20-km cell in the National Forest Inventory (NFI) grid [46]. There are approximately 13 million 1-day space-time voxels spanning 2541 grid cells and 34 fire seasons  where human-and lightning-caused fire occurrences were observed in only 0.23% and 0.18% of the voxels respectively, which highlights the severity of class-imbalance in the observed distribution of Y. A large number of candidate covariates were compiled including geographic, vegetation, ecumene, surface fire weather, atmospheric stability and other derived variables (see Table 1 in Nadeem et al. [29]). We refer the reader to Nadeem et al. [29] for more details about the study area and methods used in data collection and compilation.
Nadeem et al. [29] employed the lasso-logistic modelling framework to develop the following three fire occurrence models: i) a human-caused fire (HCF) model; ii) a lightning-caused fire (OLCF) model which included observed cloud-to-ground lightning strikes as a candidate covariate, and iii) a lightning-caused fire (PLCF) model which did not include the lightning strike covariate but included atmospheric stability covariates as proxies for lightning strikes. Note that, OLCF did not include the atmospheric stability covariates. The PLCF and HCF models were trained on data from 1981-2008 (10.69 million observations), leaving the years 2009-2014 for testing the fitted models. The OLCF was trained on data from years 1999-2008 and 2010-2013 (5.12 million observations), leaving the years 2009 and 2014 for testing. Number of cases in the training dataset under HCF, PLCF and OLCF were 22,525; 24,392 and 10,128; respectively. The case-control class-imbalance ratio for the HCF, PLCF and OLCF models were therefore approximately 1:469, 1:433 and 1:504, respectively. Nadeem et al. [29] employed response-based sampling to handle this severe class-imbalance and used p drop to perform variable selection. We emphasise that our selection methodology in this study is instead based on Rank(X i ), which works for both Lasso and Ridge regularization.
We illustrate our SVRS algorithm by reanalyzing HCF, PLCF and OLCF logistic regression models using both Lasso and Ridge regularization where we also permute a subset of variables that were deemed unimportant in Nadeem et al. [29]. Permuting a covariate destroys its relationship with the observed response vector and is widely implemented to assess variable importance in regression and classification models, e.g. random forests [47]. Here, barring the 36, 32 and 27 highest ranked covariates in HCF, PLCF, and OLCF models respectively, as reported in Nadeem et al. [29], all the remaining candidate variables are permuted before implementing the SVRS algorithm. Total number of candidate variables included in HCF, PLCF, and OLCF are 82, 69 and 67 respectively (Table 7). Figs 7 and 8 plot Rank(X i ) scores for the various models along with selection thresholds computed via the changepoint detection method. We also assess change in selection performance with low and high number of downsampled balanced datasets. We further consider the effect of choosing the optimally regularized model as compared to the more conventional practice of choosing a parsimonious model corresponding to tuning parameter (λ) value that satisfies the one-standard-error rule [48][49][50]. Following the terminology used in the glment R package, we denote the λ values corresponding the best and one-standard-error based models as λ min and λ 1se in Figs 7 and 8 and Table 7.
Regardless of the choice of the tuning parameter and number of balanced datasets (M) used, selected threshold values for HCF, PLCF and OLCF models in nearly all cases show that there is a marked change (or a breakpoint) in sorted Rank(X i ) scores which naturally separates the candidate covariates in two mutually exclusive sets of important (signal) and unimportant (noise) covariates (Figs 7 and 8). It is also evident that the changepoint method accurately detects these breakpoints as shown by the symbols ×. The key finding here is that all permuted covariates are classified as unimportant under Lasso and Ridge regression for λ min and λ 1se ( Table 7; M = 500), providing strong evidence that our variable selection approach achieves very low FPR and FDR scores for complex real-world datasets. Notice that this result remains unchanged with M = 100 for both Lasso and Ridge regularization methods. On the other hand, FDR and FPR scores computed from the individual Lasso fits are at unacceptably high levels as shown in Fig 9 even when λ 1se is used to force a stronger penalization of the regression coefficients. This finding is inline with the behavior observed in the simulation experiments where individual fit based FDR and FPR values for Lasso/adaLasso were highly elevated under all simulation models. Furthermore, we found that a large proportion of unpermuted covariates are deemed important which agrees with the variable selection results obtained by Nadeem et al. [29] based on Lasso-logistic modeling of the original unperturbed dataset  (Table 7). We refer the reader to Nadeem et al. [29] for a detailed discussion of the relevance of selected covariates to the wildland fire occurrence process in British Columbia, Canada.

Discussion
This study presents a novel variable ranking and selection method, SVRS, for regularized logistic regression in the context of severely imbalanced and potentially massive and high-dimensional datasets. It consists of three basic components: i) a base regularization method, e.g. Ridge regression, that outputs regularized estimates of regression coefficients, ii) responsebased sampling to generate an ensemble of regularized coefficients by fitting the base algorithm to several balanced datasets, and iii) an algorithm to stabilize the variable rank scores from the ensemble to select variables having with high rank scores. The method is very flexible as we demonstrate its applicability for regularization methods that enforce both hard-and soft-shrinkage of the regression coefficients. The simulation experiments show that methods with built-in structure estimation, such as Lasso, can produce highly instable and misleading selection results for high-dimensional data. Analysis of permuted wildland fire occurrence data further reveals that these methods can fail spectacularly at controlling the false discovery rate. SVRS, on the other hand, stabilizes the noise in estimated regression coefficients and yields variable selection with high accuracy and very low false discovery rate.
Another potential general regularization framework for logistic regression is the elastic net penalty [6], which includes Lasso and Ridge penalties as the special case. It involves an additional mixing parameter, 0 � α � 1; where values of 0 and 1 correspond to Ridge and Lasso penalties, respectively. Here, α induces a tradeoff between l 2 -norm (Lasso) and l 1 -norm (ridge) and works well in the presence of strongly correlated covariates. We studied the performance SVRS algorithm with elastic net in a small simulation experiment under various settings as defined in Tables 1 and 2 and found that the results (not reported here) were similar to Lasso and Ridge cases examined herein. We therefore opted to focus on the two widely applicable special cases of elastic net to simplify the exposition of our methodology.
Use of response-based downsampling in our work is an instance of the general notion of subsampling and is particularly useful in reducing the computational burden when dealing with prohibitively large datasets with extreme class-imbalance. For instance, the full BC wildland fire occurrence dataset for person-caused fires comprises 13 million instances with a case-control class-imbalance ratio of 1:469, whereas a single balanced dataset generated via response-based sampling consists of 45,050 observations only. The methodology introduced herein is in the same vein as stability selection [34,51], which is a general subsampling based variable selection approach that is designed to improve selection performance of a base selection algorithm for high-dimensional data. However, SVRS differs from stability selection in the following two aspects: i) it stabilizes the regression coefficients across the ensemble whereas the latter is designed to stabilize the selections (i.e. classification of a variables into noise or signal group) performed on subsampled data; and ii) response-based sampling in SVRS ensures that the size of the subsamples can potentially be several hundreds of order of magnitude smaller than [n/2], which is the required size to implement stability selection. This latter aspect is crucial in the context of severely imbalanced and massive datasets as training the base algorithm repeatedly on millions of observations can be computationally prohibitive. It is also important to note that response-based sampling approach in SVRS alleviates class-imbalance in individual subsamples without affecting the predictive performance of the base regularization method, whereas the usual random sampling as implemented in stability selection inherits the same severe class-imbalance present in the original dataset.
In summary, this study introduces a new variable selection method for logistic regression modeling of extreme rare events data. The method combines response-based subsampling and commonly employed regularization methods to perform accurate variable selection for highdimensional and large datasets. The performance results are supported by an extensive simulation experiment and analysis of big and severely imbalanced real-life datasets.
Supporting information S1 Appendix. Analysis of a reduced case study dataset. (DOCX)